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Abstract 

The FITS (Flexible Image Transport System) data format has been the de facto data format for astronomy-related 
data products since its inception in the late 1970s. While the FITS file format is widely supported, it lacks many of 
the features of more modern data serialization, such as the Hierarchical Data Format (HDF5). The HDF5 file format 
offers considerable advantages over FITS, such as improved I/O speed and compression, but has yet to gain widespread 
adoption within astronomy. One of the major holdbacks is that HDF5 is not well supported by data reduction software 
packages and image viewers. Here, we present a comparison of FITS and HDF5 as a format for storage of astronomy 
datasets. We show that the underlying data model of FITS can be ported to HDF5 in a straightforward manner, and 
that by doing so the advantages of the HDF5 file format can be leveraged immediately. In addition, we present a software 
tool, f its2hdf, for converting between FITS and a new ‘HDFITS’ format, where data are stored in HDF5 in a FITS-like 
manner. We show that HDFITS allows faster reading of data (up to lOOx of FITS in some use cases), and improved 
compression (higher compression ratios and higher throughput). Finally, we show that by only changing the import 
lines in Python-based FITS utilities, HDFITS formatted data can be presented transparently as an in-memory FITS 
equivalent. 
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1. Introduction 

The Flexible Image Transport System (FITS) file format 
has enjoyed several decades of widespread usage within as¬ 
tronomy (Wells and Greisen, 1979; Greisen et ah, 1980). 
The ubiquity enjoyed by FITS has been attributed in part 
to the guiding maxim “once FITS, always FITS”: that 
changes to the FITS standard must be incremental so as to 
never break backward compatibility (Greisen, 2003). For 
this reason - among others - it is familiar to many gener¬ 
ations of astronomers, and the large ecosystem of software 
that has been created over the years has in turn moti¬ 
vated further adoption of the standard. In particular, the 
CFITSI0 library (ascklOlO.OOl, Pence, 1999, 2010) for the 
reading and writing of FITS files has become the de facto 
standard. 

FITS has necessarily evolved over the years, with the 
addition of features such as random groups (Greisen and 
Harten, 1981), ASCII tables (Harten et ah, 1988), binary 
tables (Cotton et ah, 1995), and compression (Pence, 2002; 
Seaman et ah, 2007; Pence et ah, 2009; Seaman et ah, 
2010). By culmination of these additions, the FITS file 
format is now officially at version 3.0 (Pence et ah, 2010). 
However, these changes have been relatively minor itera¬ 
tions upon the core FITS format. The “once FITS, always 
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FITS” maxim limits what modifications can be made; the 
guiding principle that has made FITS so successful can 
now be seen as limiting its applicability. 

The limitations of FITS are succinctly summarized in 
Thomas et al. (2014) and Thomas et al. (this issue). As 
the size of data products increase, new paradigms for data 
processing become increasingly important (Kitaeff et al., 
this issue). For example, the Large-Aperture Experiment 
to Detect the Dark Ages (LEDA, Greenhill and Bernardi, 
2012) produces 24TB per day, and the Canadian Hy¬ 
drogen Intensity Mapping Experiment (CHIME, Bandura 
et al., 2014), produces 4TB /day in its pathfinder alone. 
Future telescopes, such as the Square Kilometre Array 1 
(SKA) will produce over lOx the current global internet 
traffic (SKA Organization, 2015). Distributing such mas¬ 
sive data volumes is prohibitive and requires significant 
amounts of data reduction to be done in real-time, with 
high-throughput data compression and massively parallel 
data access. FITS is not well-equipped to deal with these 
challenges. 

Several authors have proposed alternative serializations 
that have advantages over FITS. In Kitaeff et al. (this is¬ 
sue) , the authors consider JPEG2000 as an alternative for¬ 
mat for images and data cubes. Thomas et al. (2001) dis¬ 
cuss advantages of converting FITS files to XML; Jennings 
et al. (1995) considered HDF4 (Hierarchical Data Format) 
as a format . Work toward an HDF5-based format for as- 
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tronomy was proposed by Wise et al. (2011), but funding 
for this was not secured. 

Motivated by data volumes, HDF5 has also been pro¬ 
posed or has been implemented for for the LOFAR ra¬ 
dio telescope (Anderson et ah, 2010), the CCAT telescope 
(Schaaf et ah, in press), the CHIME pathfinder (Masui 
et ah, this issue), and MeerKat telescope (HDF Group, 
2015d), among others. These implementations share a 
common file format, but the data is organized in differ¬ 
ing ways as there is no agreed-upon method. 

Here, we discuss the immediate, practicable advantages 
of HDF5 as an alternative serialization format for the FITS 
data model. We show that data model inherent to the 
FITS file format can be converted in a straightforward 
manner to HDF5, and that by doing so better compres¬ 
sion results and faster read speeds can be achieved. This 
work extends that presented in proceedings of Astronom¬ 
ical Data Analysis Software and Systems (ADASS) XXIV 
(Price et ah, in press). 

1.1. Definitions 

In order to discuss data storage methods and formats 
without ambiguity, we first need to clarify our vocabulary: 

• Data model: a high-level, conceptual model of data, 
types of data, and how data are organized, e.g. 
‘group’ and ‘dataset’. 

• Data schema : a lower-level, domain-specific ontology 
(i.e. framework that gives meaning) of how data and 
metadata are arranged inside a data model. For ex¬ 
ample, a schema may define a set of rules for the 
names of attributes and datasets, and how data are 
organized within the data model. 

• Serialization, or storage model : how objects from the 
data model are mapped to bytes within an address 
space on storage media (such as a hard drive). 

• File format: a well-defined serialization for a data 
model. 

• Convention: a documented data schema that has 
widespread acceptance within a community of users. 

• Standard: the acknowledged, formal specification of 
a file format. A standard may or may not define ac¬ 
ceptable data models and schema, but should provide 
an application programming interface API. 

From this view, the data model can be seen as syntax , 
while the data schema may be seen as the ontology that 
gives semantics. Without a well-defined schema, the un¬ 
derlying meaning of the dataset may be unclear. Neither 
the FITS nor HDF5 standards define data schema; how¬ 
ever there are registered FITS conventions 2 for certain 
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classes of data. For FITS files, the data model is closely 
tied to the storage model; in contrast HDF5 allows ab¬ 
stract data models that are divorced from the subtleties of 
the storage model. 

1.2. Motivation 

The goal of the work presented here is to create an 
HDF5-based equivalent of the FITS file format, and to 
provide utilities for converting between the two formats. 
The motivation of this approach, as opposed to creating 
an HDF5-based format from scratch, is that decades of 
widespread FITS usage has left a legacy that would other¬ 
wise be discarded. By preserving the familiar underlying 
data model of FITS, software packages designed to read 
and interpret FITS can be readily updated to read HDF5 
data. Maintaining backwards-compatibility with FITS, so 
that data stored in HDF5 files can be converted into FITS 
for use in legacy software packages is another persuasive 
reason to pursue a FITS-like data model within HDF5. 

A switch to an HDF5-based format has several advan¬ 
tages, many of which are immediately practicable. Per¬ 
haps the most compelling for next-generation datasets is 
that HDF5 supports far more compression filters than 
FITS; some comparisons of compression are shown in Sec¬ 
tion 5. For the data tested here, HDF5 compressors 
outperform FITS equivalents, although we note a larger 
cross-section of astronomical data must be compared to 
form definitive conclusions. Another compelling reason to 
switch to HDF5 is I/O speed. HDF5 allows for efficient 
reading of portions of a dataset, whether they are contigu¬ 
ous or a regular pattern of points or blocks. Additionally, 
HDF5 has parallel I/O support, which is becoming increas¬ 
ingly important for efficient processing of large datasets on 
multi-node systems. 

A good example of porting a data model is provided by 
Jenness (this issue), in which the Hierarchical Data Sys¬ 
tem (HDS) format has been reimplemented in HDF5; sim¬ 
ilarly, Jenness et al. (this issue) discusses conversion to and 
from FITS and the HDS-based NDF (N-dimensional data 
format). Together, the ability to convert FITS to NDF 
and the reimplementation of HDS within HDF5, provides 
an alternative path toward conversion of FITS to a HDF5- 
based format. A comparison of the two approaches is given 
in Section 6. 

A port of the FITS data model to HDF5 does not, how¬ 
ever, address issues with the FITS data model itself. Nev¬ 
ertheless, as the HDF5 data model is abstracted from its 
file format, an HDF5-based version of the FITS data model 
can be extended without requiring changes to the stor¬ 
age model. The HDF5-based FITS equivalent, as detailed 
here, can be used as a starting point and as a testbed for 
enhancing the FITS data model. We provide a utility us¬ 
ing the fits2hdf utility (described in section 4), a user 
can convert their data into HDF5 and convert it back into 
FITS if required. Our hope is that this provides a means 
for the astronomy community to investigate the advan¬ 
tages and disadvantages of moving away from FITS. 
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1.3. Overview 

This article is organized as follows. In Section 2, we pro¬ 
vide a comparison of the FITS and HDF5 file formats, with 
emphasis on their data models. In Section 3 we present a 
mapping of the FITS data model into the HDF5 abstract 
data model we call ‘HDFITS’; section 4 details software 
with which to convert FITS to and from the HDFITS for¬ 
mat. A comparison of compression and read speed on 
equivalent datasets in both formats is then given in Sec¬ 
tion 5. This is followed by a discussion of the benefits and 
then some concluding remarks. 


2. Comparison of FITS and HDF5 data models 

One of the main differences between FITS and HDF5 
is that FITS does not abstract the data model from the 
storage model; that is, there is a simple correspondence 
between the data model and the serialization. A FITS 
file is composed of a set of Header-Data Units (HDUs), 
which are ASCII headers followed by contiguous blocks 
of data (binary or ASCII encoded). In comparison, an 
HDF5 file is organized as a directed graph, and objects 
from the data model are mapped to the storage model. By 
use of B-tree data structures, data may be discontiguous, 
allowing insertions and resizing of datasets. HDF5 also 
allows ‘hierarchy’, whereby an object can be placed within 
a group, and nested groups of objects are allowed. 

Here, we present a brief comparison of the two data 
models; we refer the reader to Pence et al. (2010) and 
HDF Group (2015b) for further details of the two file for¬ 
mats. 

2.1. FITS data model 

The data model of FITS is closely related to the the 
FITS file format itself. Over the years, changes to the for¬ 
mat have allowed the data model to evolve, without sig¬ 
nificant change to the storage model. For example, header 
keywords may now be longer than 8 characters, but dur¬ 
ing serialization keywords are stored in a way that each 
header in the keyword remains shorter than 8 characters. 
Here, we give a short overview of FITS, with an emphasis 
on the data model. 

Header data unit. A FITS file is comprised of segments 
known as Header-Data Units (HDUs). Each HDU consists 
of an ASCII ‘header unit’ consisting of key word-value pairs 
(known as ’cards’), and may be followed by an optional 
’data unit’. The header unit consists of metadata that 
describes the structure of the data unit and the contents 
of the file. At a minimum, a FITS file contains one HDU, 
which is referred to as the ‘Primary HDU’; HDUs after this 
are referred to as ‘Extension HDUs’. A file with multiple 
HDUs is referred to as a Multi-Extension FITS (MEF), 
otherwise it is known as a Single-Image FITS (SIF). Due to 
the implementation of FITS compression, a SIF file must 
be converted into a MEF file in order to apply the compres¬ 
sion filter. As such, SIF files are becoming less common. 


SIMPLE 

BITPIX 

NAXIS 

NAXIS1 

NAXIS2 


T / file conforms to FITS standard 
16 / number of bits per data pixel 
2 / number of data axes 
440 / length of data axis 1 
300 / length of data axis 2 


Figure 1: Example of a FITS header unit, showing key¬ 
word, value, and comment structure. Here, whitespace has 
been trimmed. 


Header unit. The FITS header unit is an ASCII-formatted 
list of keyword-value pairs and short associated comments 
(Fig l). These keyword-value pairs are used to describe 
and document the data contained within the data unit 
(if present); that is, they are metadata. For example, a 
header unit may contain labels to array dimensions, coor¬ 
dinate system information, and information about the in¬ 
strument from which the data originate. Depending upon 
the type of data unit, there are some mandatory keywords 
that must be present. 

Within a FITS file, each line in the header must be no 
more than 80 characters long, with each keyword in all¬ 
caps and under 8 characters long; corresponding values 
must be no longer than 68 characters. This can be con¬ 
sidered a serialization quirk, as the data model has been 
extended by the use of special keywords, with two com¬ 
mon variants: the HIERARCH keyword can be used to allow 
keywords up to 64 characters long, and the CONTINUE key¬ 
word allows values to span over multiple lines. Each entry 
in the header unit is referred to as a ’card’, so that the 
header unit can be considered an ordered list of cards. 

Two other cards that may be present within the header 
unit are the COMMENT and HISTORY cards, which allow 
plaintext comments and notes about the file’s history. 
Long comments and history are created via multiple cards; 
again, this is more a serialization detail than an aspect of 
the data model. 


Data Unit. There are three classes of data unit (known 
as ‘extensions’) that may be stored in FITS: the IMAGE 
extension, which stores images and N-dimensional data; 
TABLE, which is used to store ASCII-formatted tables; and 
BINTABLE, which stores tables in a more efficient binary 
format and unlike the TABLE extension can store arrays 
of data. There is also a ‘random group’ data unit which 
is now deprecated but still used for radio interferometer 
data, for historical reasons. If compression is applied to 
an IMAGE extension, it is converted into a BINTABLE. 


Datatypes. The type of data within an HDU is specified in 
mandatory header cards, and is limited to: 8-bit unsigned 
integers; 16, 32 and 64-bit signed integers; 32 and 64-bit 
IEEE 754 floating point; and 7-bit ASCII (ANSI 1977) 
data. Boolean and bit data may also be stored. With the 
exception of 8-bit data serialization of unsigned integers is 
not supported, but this may be circumvented by the use 
of an accompanying scale offset keyword (BZER0). 
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World Coordinate Systems. An integral part of FITS is its 
ability to store metadata detailing the mapping between 
coordinates within an image and physical (i.e., world) co¬ 
ordinate systems (WCS). Coordinate systems are specified 
via keywords in the header unit and must follow the coor¬ 
dinate system definitions (Calabretta and Greisen, 2002; 
Greisen and Calabretta, 2002; Calabretta and Roukema, 
2007; Greisen et ah, 2006; Rots et ah, 2015). WCS infor¬ 
mation allows FITS viewers to interpret the coordinates 
that correspond to each pixel and thus overlay graticules 
and other information. 

FITS conventions. Higher-level data models do exist for 
FITS in the form of ’conventions’ 3 , which prescribe a set 
of header keywords, values, and table/image structures, 
generally for a domain-specific application. For example, 
UVFITS (Greisen, 2012) and FITS-IDI (Griesen, 2008) are 
conventions for the storage of data from radio interferom¬ 
eters, and the SDFITS convention (Garwood, 2000) was 
designed for storage of data from single dish observations. 

2.2. HDF5 data model 

HDF5 employs an abstract data model that is designed 
to conceptually cover many different models. Unlike FITS, 
HDF5 allows hierarchy within files, meaning for example 
that a group may be contained within a group. Here, we 
briefly introduce aspects of HDF5 that are directly relevant 
when considering how to port the FITS data model into 
HDF5; the HDF5 data model is described in further detail 
in HDF Group (2015b). 

Group. An HDF5 group is simply a collection of objects 
(a group is itself an object). Each HDF5 file contains a 
root group, which may contain zero or more other groups. 
Every object within an HDF5 file - apart from the root 
group - must be a member of at least one group. Concep¬ 
tually, a FITS HDU can be considered a group containing 
a header unit and a data unit. 

Dataset. An HDF5 dataset is a multidimensional array of 
elements of a given data type. A dataset could consist of 
a simple datatype (e.g. integers), or a composite datatype 
consisting of several different kinds of data element, an 
HDF5 dataset is similar to a FITS data unit. 

Datatype. The HDF5 datatype is a description of a spe¬ 
cific class of data element. Atomic datatype elements 
include strings, floats, integers and bitfields. Composite 
datatypes, such as an array, are formed by combining mul¬ 
tiple atomic data elements. HDF5 also allows for users to 
define custom datatypes, for example a 17-bit integer. All 
the datatypes supported by FITS are included within the 
standard predefined HDF5 datatypes. 


3 A registry of FITS conventions is available at http: //fits. gsf c. 
nasa.gov/fits_registry.html 


Attributes. HDF5 attributes are similar to FITS header 
cards, in that they are a keyword - value metadata pair 
used to describe data. Both HDF5 groups and datasets 
may have attributes attached to them. HDF5 attributes 
differ to FITS header cards in that the data stored in an 
attribute may be composite datatypes such as arrays; fur¬ 
ther, HDF5 attributes are not stored as an ordered list. 

Dataspace. An HDF5 dataspace is a description of the di¬ 
mensions of a multidimensional array. A dataspace pro¬ 
vides a similar description of array dimensions to the cards 
within a FITS header that detail dimensions. 

Dimension scales. An HDF5 dimension scale is a 1- 
dimensional dataset that provides information about the 
dimensions of a given dataspace. This is analogous to the 
WCS-related cards within a FITS header. 

Existing specifications. There are two HDF5 specifications 
that are relevant for the storage of astronomy data: the 
IMAGE (HDF Group, 2015a) and TABLE (HDF Group, 
2015c) specifications. These documents provide a standard 
method for storing image and tabular data, respectively. 
These specifications define mandatory attributes to define 
their properties. They are distinguished by other datasets 
via the attribute CLASS. 

FITS has shown that it is possible to store almost all as¬ 
tronomy datasets in either a table or as an N-dimensional 
image. While this may be true, there is likely significant 
advantage to defining further specifications (new classes or 
even sub-classes), that may be more appropriate and fur¬ 
ther provide more semantic meaning. We suggest that 
classes and subclasses should remain abstract, such as 
‘time series’ or ‘sparse matrix’, in contrast to higher-level 
conventions for domain-specific data (e.g. ‘single dish ob¬ 
servation’). 

3. Porting FITS to HDF5 

There are myriad ways in which FITS data could be 
stored within HDF5. We will use the portmanteau ‘HD- 
FITS’ to refer to data stored in HDF5 with a FITS-like 
data model. In order to port the FITS data model to 
HDF5, we first need to decide upon how best to create 
an HDU-like object within HDF5. There are a number of 
possible approaches, such as: 

• Single ‘HDU’ dataset with attributes. A HDU is cre¬ 
ated from a single dataset object. The values con¬ 
tained in the FITS header unit are mapped to at¬ 
tributes attached to the dataset, and the data payload 
is stored in the dataset’s dataspace. Comments and 
history are also stored as attributes. 

• A ‘HDU’ group with a ‘header’ dataset and a ‘data’ 
dataset. A HDU-like object is created by placing two 
datasets within a group. Header values, comments, 
and history are stored within a table in the ’header’ 
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CARD1 

CARD2 

COMMENT 

HISTORY 


HEADER 


COL1 

COL2 

COL3 

COL4 






DATA 


FITS HDU: HEADER DATA UNIT 




ATTR1 

ATTR2 

ATTR3 


HEADER 


COMMENT 


HISTORY 



DSET1 


DSET2 


DSET3 


DSET4 



DATA 


HDF5 GROUP: HEADER DATA UNIT 


MAPPING KEY 
FITS HDF5 

| HDU J -► GROUP j 

| CARD 1 -► ATTRIBUTE ] 

COLUMN -► DATASET 


Figure 2: Diagram showing the mapping of the FITS data structure (for a binary table) into the HDF5 data model, 
where each column is stored as a dataset within a group. Reproduced with modifications from (Price et ah, in press). 


dataset’s dataspace, and the main data payload is 
stored in the dataset’s dataspace. 

• A ‘HDU’ group with a ‘header’ attributes and a ‘data’ 
group. A HDU-like object is created by placing a child 
dataset (or group) within a parent group. Header val¬ 
ues are stored in the parent group’s attributes; ’com¬ 
ment’ and ’history’ datasets are also placed within the 
parent group. A number of datasets may be placed 
within the ’data’ group; for example, each column in 
a table could be stored as a dataset. 

We have implemented the latter (Figure 2), as we have 
found it to be intuitive, while allowing flexibility for the 
data model to be extended in the future by addition of 
other datasets or groups. 

The next step is to define how HDU objects are arranged 
within the HDF5 file. As HDF5 allows hierarchy, it would 
allow groups of related HDUs to be stored; we elected to 
enforce a that all HDUs are located in the root group, as 
otherwise the data model would not be compatible with 
FITS. Unlike FITS, HDF5 does not enforce ordering of 
groups 4 ; to reproduce this each group must have an at¬ 
tribute identifying its position within the HDU list. 

We use the attribute CLASS=HDU to identify header data 
units, and the attribute CLASS=HDFITS in the root of the 
file to identify the file as a HDFITS file. We further use the 
keyword CLASS to distinguish images and table datasets. 
There is no need to constrain attribute keywords to be all¬ 
caps and 8 characters in HDF5, but for backwards compat¬ 
ibility we suggest that this is appropriate, at least for a first 
HDFITS revision. The structure of an example HDFITS 
file is shown in Figure 3. Note that in HDF5, objects may 
have multiple names, so groups and datasets are assigned 


4 The default ordering is increasing lexicographic by attribute 
name. One might instead track by attribute / link creation time, 
but this is less transparent. 


names by the links that reference them; we enforce that 
each group and dataset has a single name that identifies 
it. 

HDF5 dimension scales provide a more flexible way of 
describing coordinate systems than the FITS WCS header 
keywords; this is as entire datasets, complete with their 
own attributes, may be linked to label the scale of each 
dimension. Since these are 1 dimensional, this is still in¬ 
sufficient for cases where pixel mappings are not parallel 
to the axes (for example, spherical coordinates). In such 
a case, a parallel array of dimensional mappings would 
be more appropriate. For backwards-compatibility with 
FITS, we have not implemented dimension scales or fur¬ 
ther deviations in this initial version of HDFITS. 

3.1. FITS headers 

There are a number of approaches one can take to imple¬ 
menting an equivalent of the FITS header within HDF5; 
we opted to map them to HDF5 attributes. One could 
also consider storing the header in a dataset, perhaps 
even maintaining the 80-character per line card structure. 
The latter approach would be advantageous for ‘round- 
tripping’ — conversion from FITS to HDF5 and back 
again — as the header could be kept identical and intact. 
The disadvantage is precisely that this does not parse the 
header into HDF5 attributes, which means knowledge of 
how to parse the FITS header is required for understand¬ 
ing of the corresponding data unit. Creation of HDFITS 
files from scratch would also require creation and storage 
of superfluous FITS header cards. 

Another approach would be to store keyword, value, 
and comment as columns of a tabular dataset. Nonethe¬ 
less, this still requires understanding and parsing the FITS 
header. It remains that it is more in keeping with the 
HDF5 abstract data model to store metadata in attributes, 
hence our implementation. Further discussion of round¬ 
tripping is given in Section 4.1 below. 
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4.1 Round-trip FITS conversion 

3.2. Image HD Us 

There is an existing HDF5 IMAGE specification 
(HDF Group, 2015a) for the storage of image data, which 
we reuse here for the storage of n-dimensional datasets as 
are stored in FITS IMAGE HDUs. An image dataset is dis¬ 
tinguished by an attribute CLASS=IMAGE attached to the 
dataset. We note that semantically, much data stored in 
FITS image HDUs is not an image at all, but rather an N- 
dimensional dataset. Future implementations of HDFITS 
may elect to create a class NDDATA, and to differentiate be¬ 
tween different kinds of data (e.g. images, spectral cubes), 
using a SUBCLASS attribute. 

3.3. Table HDUs 

An HDF5 specification for the storage of data tables al¬ 
ready exists (HDF Group, 2015c), but we have chosen an 
alternative implementation, where CLASS=C0LUMN datasets 
are stored with a group with attribute CLASS=DATA_GROUP. 
Each column within a DATA_GR0UP has attributes to de¬ 
scribe its data, such as its position in the table (POSITION 
attribute), and the units (UNIT attribute) of the contained 
data. Our motivation for this is that compression algo¬ 
rithms work better on non-compound datatypes, and also 
that this allows for columns to be added and deleted with 
requiring dataset resizing. Additionally, data analysis may 
be orders of magnitude faster if performed on data held in 
‘column stores’ than it can on data held in ‘row stores’ 
(Abadi et ah, 2008). 

The fits2hdf and hdf2fits programs (detailed be¬ 
low), parse both the HDF5 specified CLASS=TABLE and the 
CLASS=DATA_GROUP to create an in-memory table object, 
which can then be written back to either a CLASS=TABLE 
or CLASS=DATA_GROUP object. 

4. The fits2hdf software package 

We have implemented a software package called 
f its2hdf 5 , which converts a FITS file into HDFITS files. 
This utility is written in Python, and uses the astropy 
(ascl: 1304.002, Astropy Collaboration et ah, 2013) library 
for FITS I/O, and h5py 6 library for HDF5 I/O. As the 
HDFITS data model is a restricted subset of the complete 
HDF5 data model, any HDFITS file may be converted 
back into a FITS file without complication; we provide a 
utility hdf2f its to provide this functionality. 

Internally, fits2hdf converts both HDFITS and FITS 
data into an intermediary in-memory data objects. These 
objects are subclasses of the astropy NDData and Table 
classes. The astropy library is designed to create data 
objects such as these that are abstracted from the details 
of the underlying serialization. For legacy reasons, the 
astropy FITS handling routines do not directly read into 


5 https://github.com/telegraphic/fits2hdf 

6 http://www.h5py.org/ 



Figure 3: A tree diagram showing an example implemen¬ 
tation of a HDFITS data model. 

these abstract classes, but future releases intend to rec¬ 
tify this (Astropy Collaboration, 2015). Astropy does not 
provide an abstracted ‘HDU list’ object, nor objects to 
store comments and history; these are instead defined in 
fits2hdf. Future development of fits2hdf will attempt 
to align with the development path of astropy, and if pos¬ 
sible, all of the f its2hdf functionality will be transferred 
to the main astropy package. 

4-1. Round-trip FITS conversion 

A perfectly lossless conversion from FITS into HDFITS, 
and back into FITS, should produce an output file that is 
byte identical to the input. We stress that this is not the 
case with f its2hdf , so caution that it should not be used 
for data archiving without careful comparison of input and 
output data. 

There are several reasons as to why input and output 
files differ. The first is simply that f its2hdf adds its own 
comments to the HISTORY table. A second reason is that 
many FITS files contain comments on mandatory header 
cards, such as 

SIMPLE = T / This file is a valid FITS file 

BITPIX = 16 / Number of bits per pixel 

that encode information meant for the FITS reader, not 
the user. These cards, while necessary, are automatically 
generated when creating FITS HDUs with astropy; HDF5 
does not have an equivalent. As such, preserving such in¬ 
formation is not constructive, and would require adding 
useless attributes to the HDF5 file. Put another way, 
FITS cards that describe dataspaces and datatypes are dis¬ 
carded, and only FITS header cards that provide valuable 
metadata to the end user are kept. 

Similarly, table keywords such as TUNIT are stripped and 
parsed by fits2hdf and their information added to the 
astropy column object. Units that are not formatted as 
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suggested in the FITS standard (e.g. DEGREES in lieu of 
DEG), are fixed where possible. 

Secondly, some FITS files can store tabular data in ei¬ 
ther random groups, which are deprecated in the FITS 
v3.0 standard, ASCII tables, or binary tables. The inter¬ 
mediary data format used within fits2hdf does not dif¬ 
ferentiate between these different table serializations, and 
in export all tabular data is written to the more efficient 
binary table extension. 

Nonetheless, the values of the data contained within a 
HDU’s data portion will remain unchanged. We provide 
a utility called fits2fits in the root directory of the 
fits2hdf package, to facilitate a comparison between a 
‘round-tripped’ FITS file. 

5. Performance comparison 

Performance comparisons of FITS and HDFITS were 
conducted on a Macbook Pro late 2013 model with a solid- 
state disk. Results in Section 5.2 are from data stored on 
a Western Digital 1TB hard disk (WDC WD1000CHTZ- 
0). Both HDF5 and FITS files were read into memory by 
Python 2.7.6 scripts which were run within the iPython 
vl.l environment. HDF5 files were read with h5py v2.3.1, 
with HDF5 vl.8.13, and FITS files were read using the 
astropy vl.O FITS I/O library. To test FITS compression, 
we used the FPACK utility (ascl: 1010.002, Seaman et ah, 
2010), provided as part of the CFITSI0 v3.3 package. 

As FITS does not support parallel I/O, we use the serial 
version of HDF5 for comparison purposes; as such, no par¬ 
allel read/write tests were performed. Fair comparison of 
differing lossy compression schemes is more involved than 
we wish to undertake here. As FITS applies lossy com¬ 
pression to floating point data, we only consider integer 
data compression ratios. 

5.1. Compression 

Both FITS and HDF5 support compression; FITS via 
the FPACK utility and HDF5 via its filter pipeline. A no¬ 
table feature of HDF5 is that data can be compressed 
and decompressed automatically and transparently to the 
end user. Different filters, such as pre-compression shuf¬ 
fling of data, can be chained together to enhance achieved 
compression ratios. Also notable is the speed: the loss¬ 
less bit shuffle compression algorithm has been shown 
to achieve throughputs in excess of 1GB/s on a single 
core while maintaining good compression ratios, by ex¬ 
ploiting AVX2 and SSE2 instructions that are available on 
x86 processors. A near-lossless version of the bit shuffle 
algorithm has been designed specifically for radio interfer¬ 
ometric datasets. The CHIME pathfinder (Masui et ah, 
this issue) has implemented this algorithm for real-time 
data compression and storage, and boasts a compression 
ratio of 3.57x, a write speed of 773 MiB/s and a read speed 
of 1147MiB/s. 


We used the f its2hdf software package to create equiv¬ 
alent datasets in FITS and HDF5 formats, and then com¬ 
pared write speed and lossless compression ratios. On gen¬ 
erated datasets containing random integers, bit shuffle 
consistently outperformed FITS and standard GZIP (Table 
1), in both compression ratio and speed. We used random 
integers with bounds (—2^, 2^), where 7V=7,15,23 and 31, 
and stored these data as 32-bit integers. In theory, while 
uniformly distributed integers are non-compressible, as the 
entire dynamic range of the 32-bits is not exercised it is 
possible to compress these data by precisely 4x, 2x, 1.25x 
and lx. 

FITS compression was performed using FPACK. Here, we 
report the results using the default Rice compression filter 
with standard options. The HDF-based filter LZF, used 
with byte-level shuffling and scale-offset options (LZF_SS 
in table), performed slower than bit shuffle but faster 
than the FITS Rice-based compression algorithm. 

Compression tests were also performed on tabular data 
(Table 2). We selected three large (>500 MB) FITS files 
containing binary tables from the Sloane Digital Sky Sur¬ 
vey Data Release 12 (SDSS DR-12, Alam et ah, 2015), and 
compared the compression of LZF_SS, bitshuffle and 
FITS Rice compression 7 . In our tests, LZF_SS performed 
significantly faster than FITS Rice compression and 
achieved a higher compression ratio, while bitshuffle ran 
the fastest but only achieved a modest compression ratio. 

5.2. Data access 

Read speed of a file format is a major issue, which is be¬ 
coming progressively more important as average dataset 
sizes increase. In HDF5 is that data can be stored in ei¬ 
ther contiguous blocks, or in discontiguous ’chunks’, if 
data access patterns are known, a significant speedup in 
read performance can be achieved as only chunks that con¬ 
tain relevant data need to be accessed; however, the entire 
chunk must be read. If compression is also used, only data 
within the chunk being accessed need be decompressed, 
and access to the raw data remains transparent to the end 
user. 

To benchmark real-world performance of HDF5 against 
FITS, we generated a 3-dimensional dataset with dimen¬ 
sion sizes (10000, 200, 200), consisting of random inte- 
ger32 data in the range (-2 23 , 2 23 ). We stored these data 
in FITS, HDF5, and HDF5 compressed with bitshuffle, 
resulting files of size ^ 1.5GB, ~ 1.5GB, and ~ 1.2GB. For 
the HDF5 file, we specified a chunk size of (1000, 10, 10). 
During read tests, each read was repeated 16 times and 
multiple copies of the file were used so that data were not 
read from cache. 

When reading these data back in their entirety, aver¬ 
age read times were 14.3, 15.7 and 13.9s, for FITS, un¬ 
compressed HDF5, and bitshuffle compressed HDF5, 


7 Compression of binary tables is an experimental feature within 
FPACK, enabled by the -table flag 
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Image details 


FITS RICE 

HDF5 LZFSS 

HDF5 Bitshuffle 

GZIP 

data range (2^) 

image size 

ratio 

time (s) 

ratio 

time (s) 

ratio 

time (s) 

ratio 

time (s) 

7 

2048 x 2048 

3.52 

0.15 

3.55 

0.12 

3.84 

0.04 

2.47 

3.64 

15 

2048 x 2048 

1.87 

0.17 

1.92 

0.17 

1.98 

0.04 

1.46 

3.01 

23 

2048 x 2048 

1.28 

0.21 

1.33 

0.17 

1.32 

0.04 

1.13 

0.82 

31 

2048 x 2048 

0.99 

0.24 

1.00 

0.12 

1.00 

0.05 

1.00 

0.55 

7 

8192 x 8192 

3.54 

2.03 

3.55 

1.82 

3.84 

0.49 

2.47 

59.08 

15 

8192 x 8192 

1.88 

2.55 

1.92 

2.49 

1.98 

0.55 

1.46 

49.63 

23 

8192 x 8192 

1.28 

3.08 

1.33 

2.85 

1.32 

0.64 

1.13 

13.91 

31 

8192 x 8192 

0.99 

3.38 

1.0 

1.95 

1.00 

0.68 

1.0 

9.37 

Table 1: Compression ratios on uniform distributed random integers. 

File details 




FITS RICE 

HDF5 LZF_SS 

HDF5 Bitshuffle 

file name 



size (MB) 

ratio 

time (s) 

ratio 

time (s) 

ratio 

time (s) 

photoMatchPlate-drl2.fits 


628.3 

2.22 

29.30 

2.56 

4.76 

1.51 

3.24 

seguetsObj Set AllDup-0338-3138-0101. fits 

144.4 

1.86 

8.49 

4.60 

1.99 

1.69 

0.96 

ssppOut-drl2.fits 



1913.3 

2.94 

79.86 

3.03 

22.59 

1.62 

14.88 


Table 2: Compression ratios on randomly selected SDSS DR12 binary tables data. 


respectively. Without chunking, HDF5 performs compa¬ 
rably to FITS, indicating disk read speed as the major 
limitation. The read time for a pair of randomly chosen 
slices along the slowest-varying dimension (i.e. along the 
z-axis in a 3D data cube) were 21.1s, 0.24, and 0.21s. 

6. Discussion 

6.1. The importance of abstraction 

By abstracting the data model away from file serializa¬ 
tion, one may focus on improving and extending the data 
model, without being bogged down by implementation is¬ 
sues. Our hope is that the work presented here will facil¬ 
itate the creation of a robust, community endorsed data 
model that is broadly applicable within astronomy. This 
data model must obey agreed standards regarding issues 
such as coordinate systems, units, and uncertainties. An¬ 
other benefit of abstraction is that when next-generation 
file formats inevitably appear, the community can adapt to 
use them effectively, without needing to completely rewrite 
existing software packages. Indeed, community discussion 
as to the future of astronomical data formats has already 
begun in earnest, see for example Mink et al. (in press). 

6.2. HDF5 performance 

As shown in Section 5.2, HDF5 outperformed FITS in 
data read speed by almost two orders of magnitude, for the 
case where data along the slowest-varying axis of a data 
cube only is read. In contrast, when reading the entire 
dataset, FITS reads slightly faster (10%) than chunked, 
uncompressed HDF5 files. While full characterization of 
the effect of data access pattern upon read performance is 
beyond the scope of this paper, the results presented here 
are representative of best and worst-case scenarios. For 


any application where portions of a larger dataset are read, 
a chunked HDF5 is likely to give better read performance. 

In terms of compression, the HDF5 LZF and bitshuf f le 
compression algorithms achieve higher throughput and 
compression ratios than FITS Rice compression. How¬ 
ever, we have not compared lossy compression algorithms. 
When compressing floating point data, FITS applies a pre¬ 
compression scaling filter based upon the noise present in 
the image, and also applies a ’subtractive dithering’ tech¬ 
nique (Seaman et ah, 2007; Pence et ah, 2009). The HDF5 
scaleoff set pre-compression filter provides similar func¬ 
tionality to the FITS scaling filter, but there is currently 
no equivalent to the subtractive dithering. Such function¬ 
ality could be added to HDF5 by porting the subtractive 
dithering algorithm from FPACK into a HDF5 filter. Alter¬ 
native compression schemes, such as those that underlie 
JPEG2000, could also be ported to HDF5. 

6.3. Alternative approaches 

An important conclusion from the work presented here is 
that it is possible to decouple the FITS data model from 
the FITS file format. In this section, we discuss some 
alternative approaches and recent work on data models 
within astronomy. 

Starlink. The Starlink package (ascl: 1110.012) provides a 
utility fits2ndf, which converts a FITS file to NDF for¬ 
mat. As Jenness (this issue) reimplemented the underlying 
HDS format of NDF within HDF5, one can use f its2ndf 
to convert a file into the new HDF5-based NDF format. 
NDF files are distinguished by the attribute CLASS=NDF in 
the root group. The NDF data model defines optional ar¬ 
ray components, such as variance estimates (VARIANCE) 
and pixel quality (QUALITY), making it more extensive 
than the HDU-based data model employed in FITS. When 
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converting a FITS file to NDF, the fits2ndf program 
stores the header cards as-is (i.e. as a string); this dif¬ 
fers to the approach of HDFITS, where cards are parsed 
and converted to HDF5 attributes. 

As discussed in Jenness et al. (this issue), there is much 
to be learnt from the NDF data model, just as there is from 
FITS. With both NDF and FITS data models ported to 
use the HDF5 file format, both files can be read via the 
HDF5 API. One then only has to consider the differences 
with the data models, without concern to the minutae of 
the serialization. 

Measurementsets. The MeasurementSet (MS), used by 
the CASA reduction package (ascl: 1107.013), is a common 
file format for visibility data in radio astronomy (Kemball 
and Wieringa, 2015). The MS storage model is a directory 
consisting of several data files nested inside child directo¬ 
ries. MS has no in-built compression capability, but does 
support chunking, caching and has a query language for 
data selection. The MS standard defines data schema for 
images, visibility data, and single-dish data. 

VO Table. The International Virtual Observatory Alliance 
(IVOA) VOTable 8 format also addresses some of the issues 
of FITS. VOTable is based upon extensible Markup Lan¬ 
guage (XML). By itself, VOTable does not support fea¬ 
tures such as chunking, and compression of binary data, 
but it can be used to store the metadata required to setup 
a socket-based data stream. 

ASDF. Another alternative format that is currently in de¬ 
velopment is the Advanced Science Data Format 9 (ASDF), 
which combines human-readable metadata with raw bi¬ 
nary data (similar to FITS), and is designed primarily as 
an interchange format. By design, the ASDF file format 
is simplistic, so lacks features available in the more com¬ 
plex and abstract HDF5, such as chunking and parallel 
I/O support. 

JPEG2000. Kitaeff et al. (this issue) present impressive 
lossless compression ratios (3-4x) on astronomy image 
datasets using JPEG2000. The case for JPEG2000 is 
particularly compelling for cloud-based data reduction ap¬ 
proach, where small portions of extremely large datasets 
may be sent to a client’s browser via the JPEG2000 in¬ 
teractive protocol (JPIP). Nevertheless, JPEG2000 is not 
flexible enough to store astronomy data products such as 
interferometer visibilities or large data tables. 

6-4- Adding HDF5 support to existing packages 

HDF5 implements a high-level API with C, C++, For¬ 
tran, and Java interfaces. In addition Python, IDL, Math- 
ematica and MATLAB are all already able to read and 
write HDF5 files; by extension, these languages can also 


8 http://www.ivoa.net/documents/VOTable/ 

9 http://asdf-standard.readthedocs.org/en/latest/ 





Figure 4: Screenshot of HDFITS formatted data as dis¬ 
played in the Ginga viewer. The image shown is the iconic 
“pillars of creation” taken with the Hubble Space Tele¬ 
scope. 

read and interpret HDFITS files. However, it should be 
noted that the HDF5 API provides only a low-level ap¬ 
proach to loading data. 

A major advantage of maintaining a FITS-like data 
model is that minimal changes are required to add sup¬ 
port for HDF5 to existing software packages. This is as the 
application programming interface (API), does not neces¬ 
sarily need to change. This is well evidenced by the work 
of Jenness (this issue), who reimplemented the Hierarchi¬ 
cal Data System (HDS) using HDF5, by producing an API 
for HDF5 I/O that is near-identical to the existing HDS 
API. 

The equivalent of this for FITS would be to reim¬ 
plement CFITSIO to interface with HDF5 instead of 
FITS. We have implemented a proof-of-concept API in 
fits2hdf, that reimplements the open() function from 
astropy. io. f its. By doing so, we were able to get 
the Python-based FITS viewers Ginga (ascl: 1303.020) and 
Glue (ascl: 1402.002) to read HDFITS files simply by 
changing the import statement from 

from astropy.io import fits 

to 

from fits2hdf import pyhdfits as fits 

A screenshot of Ginga displaying HDF5 data is shown in 
Figure 4. 

6.5. Real-time data storage in radio astronomy 

The f its2hdf package is being used in the LEDA exper¬ 
iment (Greenhill and Bernardi, 2012), to convert raw in¬ 
terferometric data to bitshuf f le compressed HDF5 files. 
The LEDA correlator computes the cross product of 512 
inputs of 2400 frequency channels, resulting in an output 
data rate of 2.5GB per accumulation; at the current inte¬ 
gration length of 9s, this results in ^24TB / day. Compres¬ 
sion and file serialization is performed in approximately 
1/3 of real time, with an achieved lossless compression ra¬ 
tio of ^1.8 (55% of original size). 
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6.6. Improving HDFITS 

Our motivation for HDFITS was to provide a path to¬ 
ward adoption of HDF5 within the astronomy community. 
Extending the data model will require input and discus¬ 
sion within the community, to ensure that all needs and 
requirements are met. 

That being said, there are several modifications that we 
propose would enhance HDFITS. The first is to provide 
better support for columns of data with masked values. 
This could perhaps be added using HDF5 dimension scales, 
which would also be a more flexible and appropriate way of 
associating coordinate scales with a dataset. Secondly, a 
specification for addressing uncertainties is sorely missing. 
We also suggest that documentation (hypertext or latex), 
and relevant source code could be included to enhance data 
provenance. With community coordination, many of the 
limitations documented in Thomas et al. (this issue) could 
be addressed. 

7. Conclusions 

The FITS file format has been an integral part of as¬ 
tronomy data analysis for over 35 years. It has created 
an ecosystem of software that has greatly benefitted the 
astronomical community. That being said, FITS is ill- 
equipped to deal with the challenges that ever-increasing 
data volumes impose. The proposed HDFITS standard 
as introduced here offers immediate advantages, is more 
future-proof, and maintains the core aspects of the FITS 
data model. We have shown that an HDF5-based format 
achieves higher throughput and better lossless compres¬ 
sion ratios than FITS, and also offers faster read access 
via dataset chunking. 
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